Libraries

Dataset

Row data: consist of train and test data.Row data does not have the information of the labels.Due to privacy concerns, the name of the features are masked except the first four columns. These are:

Pandas Profiling: Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

As we can see in the above report, there is no missing data in any of the columns.

PREPROCESSING

As we can see in the above report, VAr_39 and Var_53 consists of boolean values.That's why I applied the encoding process.

Prepocessing data:

We used MinMaxScaler to scale the data between the range (0,1) like our categorical variables.

Feature Selection

Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. To select the best features, we implemented two different methods, which are "KBest" and "Recursive Feature Elimination".

Select K Best (chi2 score func)

This method selects features according to the k highest scores by using Chi-squared scoring. Here, k is the number of features you want to select. It creates the object for SelectKBest and fit and transform the classification data.

There is a drastic change in the importance of some features. This is interesting. The scaled feature importance results are preferred.

Let's drop n least important features from our data:

Recursive Feature Elimination

The Recursive Feature Elimination (RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

The example below uses RFE with the random forest algorithm to select the top 30 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

And then, RFE with logistic regression and gradient boosting are applied, respectively.

Evalution Functions

We may want to fit the models with balanced sample weight as this problem has a class imbalance problem (3:1).

That's why an option to use sample weight when we fit the models is given in our get_score function.

It returns the scores acquired from 5 fold stratified cross validation as a dataframe.

Models

We can see which variables are more important in the tree learning method from the above figure.

Random Forest

Random Forest is a method that ensembles basic decision tree models.

This time, we repeated the same steps for Random Forest method.

Optimizations

Gradient Boost

In Gradient Boosting, each predictor tries to improve on its predecessor by reducing the errors. instead of fitting a predictor on the data at each iteration, it actually fits a new predictor to the residual errors made by the previous predictor.

We followed same steps for GradientBoostingClassifier, XGBoost and Catboost.

The main difference in GradientBoost is the way we did the grid search, which is stepwise.

We tried the most important parameters first and searched the grid for a small number of parameters at a time.

At the end, we did grid search for all parameters but used a small range.

Feature selection seems to work well with 30 features.

We can continue with X_selected30.

Also, using sample weight when fitting our models work well.

Submisson

According to used ml models gradient boosting gives best score.

Conclusion:

Many ml models can be developed for such dataset.But instead of trying all of them, I tried the models that I thought would give good results.While trying the models, I also applied optimization and feature engineering.